Python dataframes with pandas and polars

Andreas Beger and Isaac Chung
PyData Tallinn x Python CodeClub
27 November 2024

Bios

Andreas Beger

  • 🏢 Data Scientist, Consult.
  • 🏃‍♂️🐌 Slow marathoner
  • 📍 🇩🇪/🇭🇷 → 🇺🇸 → 🇪🇪
  • 🎓 PhD Political Science

Isaac Chung

  • 🏢 Staff Data Scientist, Wrike
  • 🏊‍♂️🚴🏃‍♂️ Fast triathlete
  • 📍 🇭🇰 → 🇨🇦 → 🇪🇪
  • 🎓 MS Machine Learning

🐍 We are also the PyData Tallinn co-organizers.

Getting setup

Instructions for how to follow along in notebooks…GitHub codespaces?

What are dataframes?

Definition

Tables, 2d arrays, etc.

Why?

Show a list of python dictionaries

vs

pandas data frame

Common dataframe operations

  • 📖 ✍️ read and write
  • 🔬 inspect
  • 🛒 select columns
  • 🔍 filter rows
  • 🥪 mutate, add columns
  • 👨‍👩‍👧‍👦 group and aggregate
  • 🤝 join other dataframes
  • 🧱 reshape wide, long

Section 1: pandas

History

Wes McKinney

  • 2008

originally built on top of numpy pandas 2 () adds support for arrow backend

Getting started

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "quarter": [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    "x": np.random.randn(12),
    "date": pd.date_range("2024-01-01", periods=12, freq="MS")
})

df.head()
quarter x date
0 1 1.247609 2024-01-01
1 1 0.508534 2024-02-01
2 1 2.186492 2024-03-01
3 2 -0.416343 2024-04-01
4 2 -0.272219 2024-05-01

Components of a dataframe

Series

df.x
0     1.247609
1     0.508534
2     2.186492
3    -0.416343
4    -0.272219
5    -0.486202
6     0.285188
7    -1.197386
8     2.361205
9     0.764677
10    2.414473
11    0.334568
Name: x, dtype: float64

Columns

df.columns
Index(['quarter', 'x', 'date'], dtype='object')

Index

df.index
RangeIndex(start=0, stop=12, step=1)

Input - reading data from somehwere else

accidents = pd.read_csv("../data/estonia-traffic-accidents-clean.csv")

Inspecting

accidents.shape
(14259, 8)
accidents.columns
Index(['date', 'persons_involved', 'killed', 'injured', 'county',
       'pedestrian_involved', 'accident_type', 'light_conditions'],
      dtype='object')
accidents.head()
date persons_involved killed injured county
0 2014-10-24 08:45:00 2 0 1 Harju maakond
1 2014-10-24 13:45:00 2 0 1 Harju maakond
2 2014-08-11 00:00:00 2 0 1 Harju maakond
3 2014-11-17 17:32:00 2 0 2 Harju maakond
4 2015-04-28 07:55:00 2 0 1 Harju maakond

Inspecting

accidents.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14259 entries, 0 to 14258
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 14259 non-null  object
 1   persons_involved     14259 non-null  int64 
 2   killed               14259 non-null  int64 
 3   injured              14259 non-null  int64 
 4   county               14259 non-null  object
 5   pedestrian_involved  14259 non-null  int64 
 6   accident_type        14259 non-null  object
 7   light_conditions     14259 non-null  object
dtypes: int64(4), object(4)
memory usage: 891.3+ KB

Selecting columns

accidents["date"].head()
0    2014-10-24 08:45:00
1    2014-10-24 13:45:00
2    2014-08-11 00:00:00
3    2014-11-17 17:32:00
4    2015-04-28 07:55:00
Name: date, dtype: object
accidents[["date", "county"]].head()
date county
0 2014-10-24 08:45:00 Harju maakond
1 2014-10-24 13:45:00 Harju maakond
2 2014-08-11 00:00:00 Harju maakond
3 2014-11-17 17:32:00 Harju maakond
4 2015-04-28 07:55:00 Harju maakond

Mutating columns

accidents["date"] = pd.to_datetime(accidents["date"])
accidents["date"].head()
0   2014-10-24 08:45:00
1   2014-10-24 13:45:00
2   2014-08-11 00:00:00
3   2014-11-17 17:32:00
4   2015-04-28 07:55:00
Name: date, dtype: datetime64[ns]

title

title

pandas is great


2017, Wes McKinney (creator of pandas):

10 Things I Hate About Pandas

  • Inefficient memory management, need 5-10x data size
  • Eager evaluation → limited query planning
  • No multi-core

Section 2: polars

History

2020 Ritchie Vink

Uses arrow as internal representation

(Created by Wes McKinney in 2016!)

new slides

  • Out with indices
  • Out with .loc, .iloc
  • Out with [
  • In with lazy evaluation
  • Expressions

Easy to convert between the two

df = df.to_pandas()
df = pl.from_pandas(df)

example

go through same example again, but with polars

The big picture

Andy is a polars stan

Comparison

pandas

  • ✅ Very widely used and supported
  • ✅ Stable
  • ❓ More imperative, traditional API
  • ❌ Inconsistent API, multiple ways of doing the same thing

polars

  • ✅ More consistent, functional-style API
  • ✅ Faster, less memory footprint
  • ✅ Works with OOM datasets out of the box
  • ❌ API still changing

Other frameworks

  • Narwhal
  • DuckDB

Thank you!

Scan this and let us know how we did 🤗